STA 214 Codebook
Use: The purpose of this codebook is to summarize key code that we use in R to visualize and explore data, fit models, assess assumptions, and test hypotheses. I will update the codebook periodically throughout the semester as we learn new topics.
Many tasks in R, such as visualizing data, can be done in multiple
different ways. Throughout this codebook I will tend to use the
tidyverse for visualizing and cleaning data, but be aware
that other options exist.
Installing packages
Many functions and datasets we want to use in R are provided in
packages. An important package for STA 214 will be the
tidyverse package, which is really a collection of R
packages with a common philosophy on how to work with data.
To install tidyverse, run the following code in your R
console (not in an R Markdown document!):
install.packages("tidyverse")To install other packages, simply replace the name
tidyverse with the name of the package you want to install.
Note that packages only ever need to be installed once. After a
package is successfully installed, you won’t need to install it again
unless you update R on your computer.
Loading packages
Installing a package downloads it to your computer, but doesn’t make
it immediately available to your R session. If you want to use the
tidyverse package, you will need to load it by
running the following code:
library(tidyverse)(The code is similar for other packages, just replace the name
tidyverse). You will need to include
library(...) calls in the setup chunk of your R Markdown
documents.
Exploratory data analysis (EDA)
In STA 112, you learned tools for exploratory data analysis (EDA), in which we explore features of the data such as the available variables and their relationships. Exploratory data analysis is an important step before we do any model fitting or hypothesis testing, because it gets us familiar with the data and any unusual features. It is hard to fit a sensible model when we don’t know what the data look like!
This section of the codebook includes examples of summarizing the
data, identifying missing values, and univariate and multivariate EDA.
The data used for this section is the penguins dataset,
from the palmerpenguins package. To load the
penguins data, run the following in R (you may need to
install the palmerpenguins package):
library(palmerpenguins)We can then look at the description of the penguins data
provided by the package:
?penguins(this only works for datasets included in packages).
Data size and variables
To peak at the data, we can use the glimpse function to
see the variables and the size:
glimpse(penguins)## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
From this summary, we can see there are 344 rows (each row is one penguin), 8 columns, and we see the first few values in each column. Another way to explore the raw data in RStudio is to run
View(penguins)in your console (not in R Markdown!).
Other functions can also be used to calculate the number of rows, the number of columns, and both the number of rows and number of columns:
nrow(penguins) # number of rows## [1] 344
ncol(penguins) # number of columns## [1] 8
dim(penguins) # dimensions of the data## [1] 344 8
Missing data
Sometimes, our data contains missing values. These are often recorded
as NA in R (for “not available”). You can see when we
glimpsed the penguins data above that there were several
missing values (NAs), and there are probably more in the
rest of the data that wasn’t displayed.
To remove missing rows which contain missing values, we can use the
drop_na function. The following code creates a new
dataset (called penguins_new) without those missing
values:
penguins_new <- penguins %>%
drop_na()How many rows did we remove (i.e., how many rows had missing values)?
We can compare the dimensions of penguins_new to
penguins:
dim(penguins_new)## [1] 333 8
Since penguins_new has 333 rows, we have removed 11 rows
due to missing values.
Removing missing values from specific columns
What if we only want to remove rows with NAs in the
flipper_length_mm column? We can specify that column in the
drop_na function:
penguins_new <- penguins %>%
drop_na(flipper_length_mm)
dim(penguins_new)## [1] 342 8
So, there were 2 rows with missing flipper length values.
A note on pipes (%>% and |>)
Throughout this codebook (and throughout my own code too!), I use the
pipe operator %>%. The pipe just means “take THIS, then
do THAT”. So,
penguins %>%
drop_na(flipper_length_mm)means “take penguins, then drop_na (remove
missing values)”. Pipes can be chained together into a longer sequence
of steps, too:
penguins %>%
drop_na(flipper_length_mm) %>%
dim()## [1] 342 8
I like pipes because I think they often make code cleaner and more readable, but you don’t have to use them. You can collapse the pipe into a nested series of functions (evaluated from the inside out):
dim(drop_na(penguins, flipper_length_mm))## [1] 342 8
Or you can save the intermediate steps:
penguins_new <- drop_na(penguins, flipper_length_mm)
dim(penguins_new)## [1] 342 8
The pipe operator %>% comes from the
magrittr package (the name is a nod to artist Rene Magritte
and his painting The Treachery of Images). Newer versions of R
also include a built-in pipe operator |> which functions
similarly:
penguins |>
drop_na(flipper_length_mm) |>
dim()## [1] 342 8
For simplicity, and because |> is more recent, in
this course I will default to the older pipe %>%.
Types of variables
Here we will deal mainly with categorical and quantitative variables.
- Categorical variables are variables which take on
one of several fixed values. These values generally do not have a
numeric interpretation.
- Examples: gender, favorite food, brand of laptop
- Binary categorical variables have exactly 2 possible values
- Quantitative variables are variables which take on
a numeric value, and which have a numeric interpretation.
- Examples: number of pets, height, weight, age
- Discrete quantitative variables only take on discrete values (e.g., number of pets)
- Continuous quantitative variables can take on an entire range of values (e.g., height is continuous if we allow heights like 60.323 inches)
An overview of ggplot for making plots
The ggplot2 package, which is part of
tidyverse, is a very valuable tool for visualizing data.
The ggplot2 packages provides the ggplot
function, which is my default for making plots. For example, here is
code to create a scatterplot:
penguins %>%
ggplot(aes(x = bill_depth_mm,
y = bill_length_mm,
color = species)) +
geom_point() +
labs(x = 'Bill depth (mm)',
y = 'Bill length (mm)',
color = 'Species',
title = "Penguin bill length vs. bill depth") +
theme_bw()## Warning: Removed 2 rows containing missing values (geom_point).
Notice that ggplot is warning us that we had some
missing values in the data. If we remove those missing values, the
warning will go away.
ggplot layers
The idea behind ggplot is to build the figure in layers.
We start off by specifying the data that we want to use
(penguins), and the variables that we want to plot
(bill_depth_mm, bill_length_mm, and
species). These variables are mapped to the
aesthetics of the plot: features like the x-axis
(x = bill_depth_mm), the y-axis
(y = bill_length_mm), and the color of the points
(color = species). We specify the aesthetics in the
aes. Other aesthetics include shape,
fill, alpha, etc.
Running the first few lines sets up the plot for us to fill in:
penguins %>%
ggplot(aes(x = bill_depth_mm,
y = bill_length_mm,
color = species))Next, we need to decide how to represent the observations in our
plot. Here we have two quantitative variables, so a scatterplot is
reasonable. We therefore add another layer to our plot, representing
each observation as a point (notice that we add layers together with the
+ sign):
penguins %>%
ggplot(aes(x = bill_depth_mm,
y = bill_length_mm,
color = species)) +
geom_point()## Warning: Removed 2 rows containing missing values (geom_point).
The way we represent our data (in this case, points) is with
geometric objects, or geoms. If we want points, we use
geom_point. Other geoms include geom_histogram
(histograms), geom_bar (bar charts),
geom_boxplot (boxplots), among many others. The sections
below give an overview of some of the most important plots.
Finally, we want to make our plot look nice. This involves changing
the axis labels and the legend, adding a title, and changing the theme.
We include axis labels in the labs layer, and we change the
theme in the theme layer. Possible themes include
theme_bw, theme_minimal,
theme_classic, theme_light, and many others. I
typically use theme_bw, but you may choose whichver theme
you prefer.
penguins %>%
ggplot(aes(x = bill_depth_mm,
y = bill_length_mm,
color = species)) +
geom_point() +
labs(x = 'Bill depth (mm)',
y = 'Bill length (mm)',
color = 'Species',
title = "Penguin bill length vs. bill depth") +
theme_bw()## Warning: Removed 2 rows containing missing values (geom_point).
Univariate exploratory data analysis
Here we discuss how to summarize, visualize, and describe the distributions of categorical and quantitative variables.
Categorical variables
Summarize
The number of observations in each group can be summarized in a frequency table:
penguins %>%
count(species)## # A tibble: 3 × 2
## species n
## <fct> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124
Visualize
The same information can be visualized with bar charts, which display the number of observations in each group as the height of the bar:
penguins %>%
ggplot(aes(x = species)) +
geom_bar() +
labs(x = "Species") +
theme_bw()Other visualization options include pie charts.
Describe
- Which category has the most number of observations? The least?
- Are observations spread relatively evenly across categories, or do one or two categories have the majority of the observations?
Quantitative variables
Summarize
Many summary statistics can be calculated for quantitative variables. We often calculate the mean or median to summarize the center of the distribution, and the standard deviation or IQR to summarize the spread of the distribution. If the data are highly skewed, the median and IQR are often more appropriate measures of center and spread.
Note that if NAs (missing values) are present in the data, then we
need to remove them before calculating summary statistics. This can be
done by removing all rows with NAs (drop_na()), or by
ignoring NAs when we calculate the summary statistics
(na.rm=TRUE).
penguins %>%
summarize(mean_mass = mean(body_mass_g, na.rm=TRUE),
median_mass = median(body_mass_g, na.rm=TRUE),
sd_mass = sd(body_mass_g, na.rm=TRUE),
iqr_mass = IQR(body_mass_g, na.rm=TRUE))## # A tibble: 1 × 4
## mean_mass median_mass sd_mass iqr_mass
## <dbl> <dbl> <dbl> <dbl>
## 1 4202. 4050 802. 1200
Visualize
A good choice for visualize the distribution of a quantitative
variable is with a histogram. A histogram divides the
range of the data into evenly spaced bins, and then displays the number
of observations which fall into each bin. Since the number of bins
affects how the histogram looks, it is good practice to experiment with
several different numbers of bins. This can be specified with the
bins argument in geom_histogram.
penguins %>%
ggplot(aes(x = body_mass_g)) +
geom_histogram(bins = 20) +
labs(x = "Body mass (g)") +
theme_bw()Another common option for visualization is the boxplot. A boxplot doesn’t show the whole distribution, but rather a summary of it. In particular, it displays the median, first and third quartiles, the smallest and largest non-outlier values, and any outliers.
penguins %>%
ggplot(aes(y = body_mass_g)) +
geom_boxplot() +
labs(y = "Body mass (g)") +
theme_bw()Other tools include density plots (geom_density) and
violin plots (geom_violin).
Describe
- Shape (symmetric vs. skewed, number of modes, location of modes)
- Center (usually mean or median)
- Spread (usually standard deviation or IQR)
- Any unusual features?
- Any potential outliers?
Bivariate exploratory data analysis
What if we want to look at the relationship between two variables?
Two categorical variables
Summarize
We can count the number of observations in each group:
penguins %>%
count(species, island)## # A tibble: 5 × 3
## species island n
## <fct> <fct> <int>
## 1 Adelie Biscoe 44
## 2 Adelie Dream 56
## 3 Adelie Torgersen 52
## 4 Chinstrap Dream 68
## 5 Gentoo Biscoe 124
Sometimes, it is nice to display the result as a two-way table, where categories for one variable are in the rows, and categories for the second variable are in the columns:
penguins %>%
count(species, island) %>%
spread(island, n)## # A tibble: 3 × 4
## species Biscoe Dream Torgersen
## <fct> <int> <int> <int>
## 1 Adelie 44 56 52
## 2 Chinstrap NA 68 NA
## 3 Gentoo 124 NA NA
(Note that here, NA means that this combination of values did not appear in the dataset. So, e.g., there were no Chinstrap penguins from Biscoe island).
Visualize
A common way to visualize the relationship between two categorical variables is with a stacked bar graph:
penguins %>%
ggplot(aes(x = species, fill = island)) +
geom_bar() +
labs(x = "Species",
fill = "Island") +
theme_bw()Other options include mosaic plots.
Describe
- Which combination of categories has the most observations? The least?
- Are there any combinations which do not appear in the data?
- Is the distribution for the second variable the same for each level
of the first variable? (E.g., in the
penguinsexample above, there appears to be a relationship between species and island, because the distribution of penguins in each island is different for the three species. Adelie penguins are found on all three islands, whereas Chinstrap and Gentoo penguins are only on one).
Two quantitative variables
Visualize
To visualize the relationship between two quantitative variables, we can use a scatterplot:
penguins %>%
ggplot(aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point() +
labs(x = "Flipper length (mm)",
y = "Body mass (g)") +
theme_bw()Summarize
If the relationship looks linear, we can calculate the sample correlation coefficient, \(r\), to summarize the strength of the linear relationship. Recall that \(r\) takes values between -1 and 1, with \(r = -1\) a very strong negative relationship, \(r = 0\) no relationship, and \(r = 1\) a very strong positive relationship.
When calculating the correlation, we have to handle NAs, if missing
values are present in the data. This can be done either by removing all
rows with NAs before hand (drop_na()), or by ignoring NAs
when computing correlation (use = "complete.obs").
penguins %>%
summarize(r = cor(flipper_length_mm,
body_mass_g,
use="complete.obs"))## # A tibble: 1 × 1
## r
## <dbl>
## 1 0.871
Describe
- does there appear to be a relationship?
- if so, does the relationship appear to be positive or negative?
- what is the general shape of the relationship? Does it look linear?
- if the relationship looks linear, report the sample correlation coefficient
One categorical, one quantitative
Visualize
There are several options for visualizing the relationship between a categorical and a quantitative variable. A common choice is to make a boxplot for each level of the categorical variable:
penguins %>%
ggplot(aes(x = species, y = body_mass_g)) +
geom_boxplot() +
labs(x = "Species", y = "Body mass (g)") +
theme_bw()While boxplots are just summaries of a distribution, they are very handy for comparing across groups.
Another option, if the number of categories isn’t too large, is to create a histogram faceted by the categorical variable:
penguins %>%
ggplot(aes(x = body_mass_g)) +
geom_histogram(bins = 20) +
facet_wrap(~species) +
labs(x = "Body mass (g)") +
theme_bw()Summarize
To summarize the relationship, we can calculate summary statistics
for the quantitative variable at each level of the categorical variable.
The group_by function is very helpful here.
penguins %>%
group_by(species) %>%
summarize(mean_mass = mean(body_mass_g, na.rm=TRUE),
median_mass = median(body_mass_g, na.rm=TRUE))## # A tibble: 3 × 3
## species mean_mass median_mass
## <fct> <dbl> <dbl>
## 1 Adelie 3701. 3700
## 2 Chinstrap 3733. 3700
## 3 Gentoo 5076. 5000
Describe
Is the distribution of the quantitative variable different across levels of the categorical variable? If so, how? (e.g., differences in shape, center, spread)
More than two variables
With more than two variables, you can get a lot of combinations. Here are just a couple examples. Using additional aesthetics and faceting is helpful for visualization. Using grouping is helpful for summary statistics.
Quantitative, quantitative, categorical
penguins %>%
ggplot(aes(x = bill_depth_mm,
y = body_mass_g,
color = species)) +
geom_point() +
labs(x = "Bill depth (mm)",
y = "Body mass (g)",
color = "Species") +
theme_bw()penguins %>%
group_by(species) %>%
summarize(r = cor(bill_depth_mm,
body_mass_g,
use="complete.obs"))## # A tibble: 3 × 2
## species r
## <fct> <dbl>
## 1 Adelie 0.576
## 2 Chinstrap 0.604
## 3 Gentoo 0.719
Quantitative, categorical, categorical
penguins %>%
ggplot(aes(x = island,
y = body_mass_g)) +
geom_boxplot() +
facet_wrap(~species) +
labs(x = "Island",
y = "Body mass (g)") +
theme_bw()penguins %>%
ggplot(aes(x = body_mass_g)) +
geom_histogram(bins=15) +
facet_grid(island~species) +
labs(x = "Body mass (g)") +
theme_bw()penguins %>%
group_by(species, island) %>%
summarize(mean_mass = mean(body_mass_g, na.rm=TRUE),
median_mass = median(body_mass_g, na.rm=TRUE))## # A tibble: 5 × 4
## # Groups: species [3]
## species island mean_mass median_mass
## <fct> <fct> <dbl> <dbl>
## 1 Adelie Biscoe 3710. 3750
## 2 Adelie Dream 3688. 3575
## 3 Adelie Torgersen 3706. 3700
## 4 Chinstrap Dream 3733. 3700
## 5 Gentoo Biscoe 5076. 5000
Logistic regression
In this section, we will continue to use the penguins
data from the palmerpenguins package (see the EDA section
above for instructions on loading the data).
Two components of a parametric model
In the penguins data, sex is a categorical
variable which is recorded as either male or female. Suppose we want to
predict penguin sex based on physical characteristics like flipper
length and body mass. Let \(Y_i = 0\)
if sex = female, and \(Y_i = 1\) if sex
= 1 (I am coding male as 1 here because by default, R will order the
levels of sex alphabetically, so female and then male).
Then we might propose the following population logistic
regression model:
\[Y_i \sim Bernoulli(\pi_i)\] \[\log \left( \dfrac{\pi_i}{1 - \pi_i} \right) = \beta_0 + \beta_1 FlipperLength_i + \beta_2 BodyMass_i\] Here \(\pi_i\) denotes the probability that \(Y_i = 1\), and the log odds are a linear function of flipper length and body mass.
The first line of this model is called the random component: it specifies the distribution of the random variable \(Y_i\). The second line is called the systematic component: it specifies how \(\pi_i\) is related to the explanatory variables.
Fitting logistic regression models
To fit the above logistic regression model in R, we use the
glm function:
m1 <- glm(sex ~ flipper_length_mm + body_mass_g,
data = penguins, family = binomial)
summary(m1)##
## Call:
## glm(formula = sex ~ flipper_length_mm + body_mass_g, family = binomial,
## data = penguins)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9901 -0.9598 0.2442 0.8898 2.1068
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.0868039 2.6621921 2.662 0.00777 **
## flipper_length_mm -0.0932568 0.0199664 -4.671 3.00e-06 ***
## body_mass_g 0.0027820 0.0003923 7.092 1.32e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 461.61 on 332 degrees of freedom
## Residual deviance: 371.98 on 330 degrees of freedom
## (11 observations deleted due to missingness)
## AIC: 377.98
##
## Number of Fisher Scoring iterations: 4
The glm function stands for generalized linear
model. We specify the response variable on the left hand side of
the ~, and the explanatory variables on the right hand
side. To fit logistic regression, we specify
family = binomial.
The summary function summarizes our fitted model. We can
see that \(\widehat{\beta}_0 = 7.087\),
\(\widehat{\beta}_1 = -0.093\), and
\(\widehat{\beta}_2 = 0.003\). Our
fitted logistic regression model is therefore
\[Y_i \sim Bernoulli(\pi_i)\] \[\log \left( \dfrac{\widehat{\pi}_i}{1 - \widehat{\pi}_i} \right) = 7.087 -0.093 FlipperLength_i + 0.003 BodyMass_i\]
Note that the equation of the fitted model includes hats on the \(\pi_i\).
Interpreting regression coefficients
The fitted coefficients above can be interpreted on the log odds scale, or on the odds scale.
Log odds:
- The estimated log odds that a penguin is male when flipper length and body mass are both 0 is 7.087. (Since all penguins have flippers and no penguins weigh 0g, this interpretation doesn’t mean much)
- A one mm increase in flipper length is associated with an estimated decrease in the log odds that a penguin is male by 0.093, holding body mass constant
- A one g increase in body mass is associated with an estimated increase in the log odds that a penguin is male by 0.003, holding flipper length constant
Odds:
- The estimated log odds that a penguin is male when flipper length and body mass are both 0 is \(\exp\{7.087\} = 1196\). (Since all penguins have flippers and no penguins weigh 0g, this interpretation doesn’t mean much)
- A one mm increase in flipper length is associated with an estimated decrease in the log odds that a penguin is male by a factor of \(\exp\{-0.093\} = 0.911\), holding body mass constant
- A one g increase in body mass is associated with an estimated increase in the log odds that a penguin is male by a factor of \(\exp\{0.003\} = 1.003\), holding flipper length constant
Empirical logit plots
The following R function can be used to create empirical logit plots:
logodds_plot <- function(data, num_bins, bin_method,
x_name, y_name, grouping = NULL,
reg_formula = y ~ x){
if(is.null(grouping)){
dat <- data.frame(x = data %>% pull(x_name),
y = data %>% pull(y_name),
group = 1)
} else {
dat <- data.frame(x = data %>% pull(x_name),
y = data %>% pull(y_name),
group = as.factor(data %>% pull(grouping)))
}
if(bin_method == "equal_size"){
logodds_table <- dat %>%
drop_na() %>%
arrange(group, x) %>%
group_by(group) %>%
mutate(obs = y,
bin = rep(1:num_bins,
each=ceiling(n()/num_bins))[1:n()]) %>%
group_by(bin, group) %>%
summarize(mean_x = mean(x),
prop = mean(c(obs, 0.5)),
num_obs = n()) %>%
ungroup() %>%
mutate(logodds = log(prop/(1 - prop)))
} else {
logodds_table <- dat %>%
drop_na() %>%
group_by(group) %>%
mutate(obs = y,
bin = cut(x,
breaks = num_bins,
labels = F)) %>%
group_by(bin, group) %>%
summarize(mean_x = mean(x),
prop = mean(c(obs, 0.5)),
num_obs = n()) %>%
ungroup() %>%
mutate(logodds = log(prop/(1 - prop)))
}
if(is.null(grouping)){
logodds_table %>%
ggplot(aes(x = mean_x,
y = logodds)) +
geom_point(size=2) +
geom_smooth(se=F, method="lm", formula = reg_formula) +
theme_bw() +
labs(x = x_name,
y = "Empirical log odds") +
theme(text = element_text(size=15))
} else {
logodds_table %>%
ggplot(aes(x = mean_x,
y = logodds,
color = group,
shape = group)) +
geom_point(size=2) +
geom_smooth(se=F, method="lm", formula = reg_formula) +
theme_bw() +
labs(x = x_name,
y = "Empirical log odds",
color = grouping,
shape = grouping) +
theme(text = element_text(size=15))
}
}Arguments:
data: the dataset of interest (e.g.,titanic)num_bins: the number of bins to use- The number of bins should be chosen based on the size of the data.
For example, with
bin_method = "equal_size", we probably want at least 15 observations per bin, sonum_bins< (number of observations)/15
- The number of bins should be chosen based on the size of the data.
For example, with
bin_method: whether to choose bins with the same number of observations ("equal_size") or the same width ("equal_width")x_name: the name of the column containing the explanatory variable (e.g.,"Fare"). The quotation marks are neededy_name: the name of the column containing the response (e.g.,"Survived"). The quotation marks are neededgrouping: the name of a categorical variable in the data; fit a separate line for each level of the categorical variablereg_formula: This is the shape of the relationship you want to plot. If you want a line, this is y ~ x (the default). Some other examples:y ~ log(x): a log transformation of xy ~ sqrt(x): a square root transformation of xy ~ poly(x, 2): a second-order polynomialy ~ poly(x, 3): a third-order polynomial
Examples
The titanic dataset contains a number of variables
recorded for passengers on the Titanic when it sunk on its
first voyage. The response variable we care about with the
titanic data is Survived: whether a passenger
survived the disaster or not. Let’s first look at Fare as
an explanatory variable. We will make an empirical logit plot with 30
equally sized bins:
library(tidyverse)
logodds_plot(titanic, 30, "equal_size", "Fare", "Survived",
reg_formula = y ~ x)That linear fit doesn’t look very good! Maybe we should try a different shape. Let’s try a quadratic fit instead:
logodds_plot(titanic, 30, "equal_size", "Fare", "Survived",
reg_formula = y ~ poly(x, 2))And maybe a log transformation too:
logodds_plot(titanic, 30, "equal_size", "Fare", "Survived",
reg_formula = y ~ log(x))Quantile residual plots
Quantile residual plots can be created using the
qresid() function of the statmod package. For
example, suppose we fit a logistic regression on the
titanic data with Survival as the response and Fare as the
explanatory variable:
m1 <- glm(Survived ~ Fare, data = titanic, family = binomial)Let’s assess whether the shape assumption is reasonable for Fare with a quantile residual plot:
library(statmod)
titanic %>%
mutate(residuals = qresid(m1)) %>%
ggplot(aes(x = Fare, y = residuals)) +
geom_point() +
geom_smooth() +
theme_bw()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The results aren’t amazing…what if we try transforming Fare? Let’s try a square root:
m2 <- glm(Survived ~ sqrt(Fare), data = titanic, family = binomial)
titanic %>%
mutate(residuals = qresid(m2)) %>%
ggplot(aes(x = sqrt(Fare), y = residuals)) +
geom_point() +
geom_smooth() +
theme_bw()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Looks a bit better!
R structures, loops, and simulation
vectors
A vector is a simple way of storing multiple items. For example, here is a vector containing the numbers 2, 1, 4:
v <- c(2, 1, 4)
v## [1] 2 1 4
The c() in R means “combine” or “concatenate” the
elements of the vector together. The items inside the parentheses are
the elements of the vector.
Length
Every vector has a length, which tells us how many elements it contains:
length(v)## [1] 3
Indexing
We can access each item in the vector by its index. That is, we specify the position of the vector, and get back the element at that position:
v[1]## [1] 2
v[2]## [1] 1
v[3]## [1] 4
We specify the index inside the square brackets [...].
For example, v[2] is the second element of the vector
v.
Vectors can store elements other than numbers. For example, here is a vector containing the letters ‘a’, ‘b’, ‘c’, and ‘d’:
w <- c('a', 'b', 'c', 'd')
w## [1] "a" "b" "c" "d"
We can combine two vectors together with the c()
function:
c(v, w)## [1] "2" "1" "4" "a" "b" "c" "d"
Sequences (seq)
Suppose we want to create a vector containing the numbers 0, 0.1,
0.2,…,0.9, 1. In other words, a sequence of numbers between 0 and 1,
incremented by steps of 0.1. We can use the seq
function:
seq(from=0, to=1, by=0.1)## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Here’s a sequence from 1 to 10, incremented by steps of 1:
seq(from=1, to=10, by=1)## [1] 1 2 3 4 5 6 7 8 9 10
When we want to increment by steps of 1, R has a handy shorthand:
1:10## [1] 1 2 3 4 5 6 7 8 9 10
Repetitions (rep)
Other times, we want to create a vector containing one value repeated
many times. We can use the rep function. For example,
suppose we want to create a vector of eleven 0s:
rep(0, 11)## [1] 0 0 0 0 0 0 0 0 0 0 0
for loops
Suppose we want to repeat a process many times (e.g., simulate many
sets of data). A for loop allows us to do this efficiently.
Here is a simple for loop which prints out the number 214
five times:
for(i in 1:5){
print(214)
}## [1] 214
## [1] 214
## [1] 214
## [1] 214
## [1] 214
for(...) indicates that we are making a for
loop. The code inside the curly braces { ... } is the code
which will get repeated. The index i here tells us how many
times we run this code. Remember that 1:5 is R shorthand
for 1,2,3,4,5. So i in 1:5 means “run the code inside the
loop for i=1, i=2, i=3,
i=4, and i=5”.
The really neat thing about a for loop is that what we
do inside the loop can depend on the index i! For example,
suppose we want to print out the numbers 1 to 5, instead of 214:
for(i in 1:5){
print(i)
}## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
Example: calculating a likelihood
Suppose we have a coin with an unknown probability of heads \(\pi\). Let \(Y_i\) denote the outcome of a flip, with \(Y_i = 0\) for tails and \(Y_i = 1\) for heads. Then \(Y_i \sim Bernoulli(\pi)\).
Now suppose we observe five flips of the coin: T, T, T, T, H (i.e., 0, 0, 0, 0, 1). Given this observed data, the likelihood of a value for \(\pi\) is
\[L(\pi) = \pi(1 - \pi)^4\]
We want to evaluate \(L(\pi)\) for
different values of \(\pi\) between 0
and 1; say, \(\pi = 0, 0.1, 0.2, ..., 0.9,
1\). We can do this calculation efficiently with a
for loop:
pis <- seq(from=0, to=1, by=0.1) # values of pi to consider
likelihoods <- c() # empty vector to store the likelihoods we calculate
for(i in 1:length(pis)){
# likelihood = pi*(1 - pi)^4, for each value of pi
likelihoods[i] <- pis[i]*(1 - pis[i])^4
}
likelihoods## [1] 0.00000 0.06561 0.08192 0.07203 0.05184 0.03125 0.01536 0.00567 0.00128
## [10] 0.00009 0.00000
Note: R does a good job at vectorizing
functions (when you apply a function to a vector, it gets applied to
every entry in that vector). So we could write this code without the
for loop:
pis <- seq(from=0, to=1, by=0.1)
likelihoods <- pis*(1 - pis)^4
likelihoods## [1] 0.00000 0.06561 0.08192 0.07203 0.05184 0.03125 0.01536 0.00567 0.00128
## [10] 0.00009 0.00000
Some useful notation and math
Sum and product notation
At its core, statistics relies on math, and math often uses notation to make equations and expressions simpler. Some notation that we will often encounter in STA 214 is sum (\(\sum\)) and product (\(\prod\)) notation.
Suppose we have \(n\) observations, \(Y_1,...,Y_n\). We want to take the sum of these observations: \[Y_1 + Y_2 + Y_3 + \cdots + Y_n\] Great! But this requires us to write out each of the terms, or use \(\cdots\) to express that there are terms we haven’t written out. Regardless, it is a little clunky. We can make this neater by using \(\sum\) as a shorthand for “sum” (\(\Sigma\) is the capital sigma in Greek, so think “S for Sum”). Using this notation, \[\sum \limits_{i=1}^n Y_i = Y_1 + Y_2 + \cdots + Y_n\] This shorthand means “sum up all the \(Y_i\), starting with \(i = 1\) and going to \(i = n\)”. In other words, sum up \(Y_1,...,Y_n\).
What about products? The product of \(Y_1,...,Y_n\) is \[Y_1 Y_2 \cdots Y_n\] which we can write more cleanly as \(\prod \limits_{i=1}^n Y_i\). Here \(\prod \limits_{i=1}^n\) just means “multiply these \(n\) things together”. (\(\Pi\) is the capital pi in Greek, so think “P for Product”)
Logs and properties of logs
Recall that logarithms (logs for short) are what we use to undo exponents. That is, \(\log(x) = y\) means that \(x = e^{y}\). (By default, \(\log\) in math and statistics is the natural log. Equivalent definitions exist for other bases: \(\log_{10}(x) = y\) means that \(x = 10^{y}\)).
There are some useful properties of logs that we use in STA 214, particularly for maximum likelihood estimation:
Logs turn products into sums: \[\log(a \cdot b) = \log(a) + \log(b)\] More generally, \[\log \left( \prod \limits_{i=1}^n a_i \right) = \sum \limits_{i=1}^n \log(a_i)\] Likewise, \[\log(a/b) = \log(a) - \log(b)\]
Logs turn exponents into multiples: \[\log(x^a) = a \log(x)\]
Logs have nice derivatives: \[ \dfrac{d}{dx} \log(x) = \dfrac{1}{x}\]
Calculus review: differentiation
Recall the following rules for derivative:
- \(\dfrac{d}{dx} c f(x) = c \dfrac{d}{dx} f(x)\) for constant \(c\) More generally
- \(\dfrac{d}{dx} (f(x) + c) = \dfrac{d}{dx} f(x)\) for constant \(c\)
- The derivative of a sum is the sum of derivatives: \[\dfrac{d}{dx} (f(x) + g(x)) = \dfrac{d}{dx} f(x) + \dfrac{d}{dx} g(x)\] More generally, this means that derivatives can move inside \(\sum\): \[\dfrac{d}{dx} \sum \limits_{i=1}^n f_i(x) = \sum \limits_{i=1}^n \dfrac{d}{dx} f_i(x)\]
- Chain rule: \[\dfrac{d}{dx} f(g(x)) = f'(g(x)) g'(x)\]
Writing math in R Markdown
If you want to write mathematical notation, we need to tell Markdown,
“Hey, we’re going to make a math symbol!” To do that, you use dollar
signs. For instance, to make \(\widehat{\beta}_1\), you simply put
$\widehathat{\beta}_1$ into the white space (not a chunk)
in your Markdown.
Here are some examples of writing math, which you can adapt:
| Math | Code |
|---|---|
| \(Y_i \sim Bernoulli(\pi_i)\) | $Y_i \sim Bernoulli(\pi_i)$ |
| \(\log \left( \dfrac{\pi_i}{1 - \pi_i} \right)\) | $\log \left( \dfrac{\pi_i}{1 - \pi_i} \right)$ |
| \(\widehat{\pi}_i\) | $\widehat{\pi}_i$ |
| \(\beta_0 + \beta_1 X_i\) | $\beta_0 + \beta_1 X_i$ |
| \(\sum \limits_{i=1}^n\) | $\sum \limits_{i=1}^n$ |
| \(\prod \limits_{i=1}^n\) | $\prod \limits_{i=1}^n$ |
This
work was created by Ciaran Evans and is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2022 March 25.